Make a highlighted time series

Time series comparisons can be handy for seeing trends in data. Here, we explore how to look at some stuff about ducks

Eukaryota
Animalia
Aves
Summaries
Intern-post
Authors

Thai Rushbrook

Olivia Torresan

Dax Kellie

Published

March 7, 2022

Author

Thai Rushbrook
Olivia Torresan
Dax Kellie

Date

7 March 2023

Thanks to huge efforts by citizen scientists, a majority of species observations in the Atlas of Living Australia are collected opportunistically, where people record observations incidentally rather than through a recurring monitoring program.

However, whether an observation is recorded or not doesn’t just depend on the species. It might be rainy, it might be too hot, an area might be inaccessible; all of these factors can affect whether people make an observation.

The COVID-19 pandemic had a major impact on people’s health, behaviour and travel. In Australia, several lockdowns over 2020-2021 imposed restrictions on people’s movements, limiting them to either stay at home or choose activities (mainly exercise) that they could do in their local area. Melbourne experienced the longest continuous lockdown in the world.

To what extent did COVID-19 and lockdowns affect the number of species observations people made over that time? Here, we’ll use a highlighted time-series plot to investigate how lockdowns affected the data collection of Anatidae observations (ducks, geese and swans), a group frequently recorded on walks and outdoor gatherings, in Melbourne compared to previous years.

Get data

We’ll start by downloading Anatidae records.

First, let’s load some packages:

# Load packages
library(galah)
library(tidyverse)
library(lubridate)
library(grid)
library(ggplot2)
library(pilot) # remotes::install_github("olihawkins/pilot")
library(ggtext)
library(showtext)

Let’s use the {galah} package to download duck records in Melbourne from 2017-2021 to get records from years just before and during COVID-19 lockdowns.

Searching with galah::search_fields() shows us that {galah} contains Greater Capital City Statistical Areas, which we can use to filter our query.

search_all(fields, "city")
# A tibble: 1 × 4
  id      description                                                type  link 
  <chr>   <chr>                                                      <chr> <chr>
1 cl10929 PSMA ABS Greater Capital City Statistical Areas (2016) AB… laye… http…
search_all(fields, "city") |> search_values("melbourne")
• Showing values for 'cl10929'.
# A tibble: 1 × 2
  field   category         
  <chr>   <chr>            
1 cl10929 GREATER MELBOURNE

Let’s build our query to return Anatidae records from GREATER MELBOURNE from 2017 to 2021. We’ll use galah_select() to return only the eventDate column.

You will need to first provide a registered email with the ALA using galah_config() before retrieving records.

# Add registered email (register at ala.org.au)
galah_config(email = "your-email@email.com")
ducks <-
  galah_call() |>
  galah_identify("Anatidae") |>
  galah_filter(
    cl10929 == "GREATER MELBOURNE",
    # year >= 2017,
    # year <= 2021,
    eventDate >= "2017-01-01T00:00:00Z",
    eventDate <= "2021-12-31T23:59:00Z",
    basisOfRecord == "HUMAN_OBSERVATION"
  ) |>
  galah_select(eventDate) |>
  atlas_occurrences()

ducks |> head(6L)
# A tibble: 6 × 1
  eventDate          
  <dttm>             
1 2017-01-01 13:00:00
2 2017-01-01 13:00:00
3 2017-01-01 13:00:00
4 2017-01-01 13:00:00
5 2017-01-01 13:00:00
6 2017-01-01 13:00:00

We’ll pull out some week and year of each date and count the total occurrence records for each week.

ducks_weekly <- ducks |> 
  mutate(date = as_date(eventDate),
         year = year(eventDate),
         week = week(eventDate)) |>
  group_by(year, week) |>
  summarise(week_count = n())

ducks_weekly 
# A tibble: 265 × 3
# Groups:   year [5]
    year  week week_count
   <dbl> <dbl>      <int>
 1  2017     1        647
 2  2017     2        670
 3  2017     3        665
 4  2017     4        790
 5  2017     5        745
 6  2017     6        580
 7  2017     7        658
 8  2017     8        680
 9  2017     9        575
10  2017    10        541
# … with 255 more rows

We are intending to compare record counts in 2020-2021 to previous years. However, because record numbers added to the ALA have grown each year, it’s likely that comparing raw numbers will show a biased view that there were more records than previous years.

To avoid this, let’s scale our record counts by the total number of Anatidae records each year so that we compare proportions rather than raw numbers.

First let’s download the total Anatidae records for each year.

ducks_yearly <- 
  galah_call() |>    
  galah_identify("Anatidae") |> 
  galah_filter(cl10929 == "GREATER MELBOURNE", 
               year >= 2017, year <= 2021) |> 
  galah_group_by(year) |>
  atlas_counts() |>
  rename(year_total = count) |>
  mutate(year = as.numeric(year)) |>
  arrange(-desc(year))
  
ducks_yearly
# A tibble: 5 × 2
   year year_total
  <dbl>      <int>
1  2017      37194
2  2018      48696
3  2019      47874
4  2020      48850
5  2021      57520

Now we’ll split ducks_weekly into 5 data.frames for each year and divide the record count by the total for each year. At the end we’ll bind the separate data.frames into one again.

ducks_scaled <- ducks_weekly %>%
  split(.$year) %>%
  map2(.x = .,
       .y = ducks_yearly$year_total, 
       ~ .x %>%
         mutate(prop = .x$week_count / .y) %>%
         select(-week_count)
  ) %>%
  bind_rows()

ducks_scaled
# A tibble: 265 × 3
# Groups:   year [5]
    year  week   prop
   <dbl> <dbl>  <dbl>
 1  2017     1 0.0174
 2  2017     2 0.0180
 3  2017     3 0.0179
 4  2017     4 0.0212
 5  2017     5 0.0200
 6  2017     6 0.0156
 7  2017     7 0.0177
 8  2017     8 0.0183
 9  2017     9 0.0155
10  2017    10 0.0145
# … with 255 more rows

In our final plot we want to compare the “normal” number of records collected to those in 2020 and 2021, and it would be nice to represent the “normal” number as one line representing the average in “normal” years. With that goal in mind, we’ll also calculate the mean proportion of records each week from 2017-2019.

To do this, we’ll place our weekly proportions in separate columns with pivot_wider() and calculate the mean proportion of observations over 2017-2019.

ducks_scaled <- ducks_scaled %>%
  pivot_wider(names_from = year, 
              values_from = prop, 
              names_sort = TRUE,
              names_glue = "year_{year}") |>
  rowwise() |>
  mutate(
    mean_2017_19 = mean(c_across(year_2017:year_2019))
    ) |>
  select(-year_2017, -year_2018, -year_2019)

ducks_scaled
# A tibble: 53 × 4
# Rowwise: 
    week year_2020 year_2021 mean_2017_19
   <dbl>     <dbl>     <dbl>        <dbl>
 1     1    0.0283    0.0218       0.0223
 2     2    0.0191    0.0186       0.0221
 3     3    0.0251    0.0202       0.0200
 4     4    0.0243    0.0175       0.0199
 5     5    0.0159    0.0203       0.0183
 6     6    0.0175    0.0205       0.0156
 7     7    0.0147    0.0157       0.0180
 8     8    0.0154    0.0190       0.0173
 9     9    0.0128    0.0184       0.0144
10    10    0.0117    0.0167       0.0161
# … with 43 more rows

Now we have our dataset, it needs a bit of reorganising to make it suitable for plotting. Counts are grouped by week of the year, meaning we have 2 sets of weeks 1-52 (one for 2020, one for 2021). To plot these in order along an axis we need to convert this to 1-53 for 2020 (a leap year, extra week), then 54-106 for 2021.

# 2021 record count proportions
ducks_scaled_2021 <- ducks_scaled |>
  rename(prop = year_2021) |>
  mutate(week = 53 + week)

# 2020 + 2021 record count proportions
ducks_final <- ducks_scaled |>
  rename(prop = year_2020) |>
  bind_rows(ducks_scaled_2021)

ducks_final
# A tibble: 106 × 5
# Rowwise: 
    week   prop year_2021 mean_2017_19 year_2020
   <dbl>  <dbl>     <dbl>        <dbl>     <dbl>
 1     1 0.0283    0.0218       0.0223        NA
 2     2 0.0191    0.0186       0.0221        NA
 3     3 0.0251    0.0202       0.0200        NA
 4     4 0.0243    0.0175       0.0199        NA
 5     5 0.0159    0.0203       0.0183        NA
 6     6 0.0175    0.0205       0.0156        NA
 7     7 0.0147    0.0157       0.0180        NA
 8     8 0.0154    0.0190       0.0173        NA
 9     9 0.0128    0.0184       0.0144        NA
10    10 0.0117    0.0167       0.0161        NA
# … with 96 more rows

Lockdowns

During the height of the pandemic, Melbourne had 6 distinct lockdowns. Let’s add their start and end dates to a tibble.

Because we want to plot 2020 and 2021 on the same plot, we’ll use ifelse() to add 54 to weeks in 202 so that our week numbers extend from the beginning of 2020 to the end of 2021.

n_lockdown <- c(1:6)
start_date <- c("2020-03-31", "2020-07-09",
                "2021-02-13", "2021-05-28",
                "2021-07-16", "2021-08-05")
end_date <- c("2020-05-12", "2020-10-27",
              "2021-02-17", "2021-06-10",
              "2021-07-27", "2021-10-21")

lockdowns <- tibble(n_lockdown, start_date, end_date) |>
  mutate(
    n_days = as_date(ymd(end_date)) - as_date(ymd(start_date)),
    week_start = ifelse(year(start_date) == 2020, 
                        week(start_date), week(start_date) + 54),
    week_end = ifelse(year(end_date) == 2020, 
                      week(end_date), week(end_date) + 54),
    )
lockdowns 
# A tibble: 6 × 6
  n_lockdown start_date end_date   n_days   week_start week_end
       <int> <chr>      <chr>      <drtn>        <dbl>    <dbl>
1          1 2020-03-31 2020-05-12  42 days         13       19
2          2 2020-07-09 2020-10-27 110 days         28       43
3          3 2021-02-13 2021-02-17   4 days         61       61
4          4 2021-05-28 2021-06-10  13 days         76       77
5          5 2021-07-16 2021-07-27  11 days         83       84
6          6 2021-08-05 2021-10-21  77 days         85       96

Make plot

Step 3: Plot!

To help us see the components of our final plot more clealy, let’s construct our visualisation step-by-step.

First, add our lockdown dates as highlighted rectangular blocks. To do this we can use geom_rect() and set xmin and xmax values to our week_start and week_end columns in lockdowns. We’ll make the rectangle spread across the entire plot by setting ymax = Inf and ymin = 0.

Setting our fill inside of aes() and defining its value within scale_fill_manual() allows us to add the yellow highlighted colour to its own legend, which will be useful later.

p1 <- ggplot() +
  geom_rect(data = lockdowns,
            aes(xmin = week_start,
                xmax = week_end,
                fill = "Lockdown"),
            ymin = 0,
            ymax = Inf,
            alpha=0.2) +
  scale_fill_manual(
    values = c("Lockdown" = pilot_color("yellow")))
p1

Next we’ll add our species observation counts as lines. We’ll also define their colours and edit the legend and axis labels.

The plot created after these steps is enough to see everything we need to see about our data. You could stop here if you wished, especially if you were only making this plot for yourself.

p2 <- p1 +
  # add lines
  geom_line(data = ducks_final, 
            aes(x = week, y = prop, 
            color = "2020-21 Records"), 
            linewidth = 1) + 
  geom_line(data = ducks_final, 
            aes(x = week, y = mean_2017_19, 
            color = "2017-19 Average"),
            linetype = "twodash", 
            linewidth = 0.8) + 
  # add fill
  geom_area(data = ducks_final, 
            aes(x = week, y = prop),
            fill=pilot_color("blue"), 
            alpha=0.3) + 
  scale_color_manual(values = c(pilot_color("orange"),
                                pilot_color("blue")), 
                     labels = c("2017-19 average", 
                                "2020-21 occurrences")) +
  guides(fill = guide_legend(title = ""), 
         color = guide_legend(title = "Year")) +
  labs(y = "Proportion of ducks recorded",
       x = "Month",
       title = "Duck, geese & swan observations in Melbourne (2020-21 vs. 2017-19)",
       subtitle = "Weekly ALA records during and prior to COVID-19",
       caption = "Weekly proportions scaled by total records for that year")
p2

Here’s a more refined visualisation with nice fonts, axis scales, axis labels and titles:

(feel free to use this code for your own plots!)

Code
# add fonts
font_add_google("Montserrat", family = "mont")
font_add_google("Hind", family = "hind")  
showtext_auto(enable = TRUE)

p2 + 
  # make axis scales understandable
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 0.035),
                     labels = scales::percent_format()) +
  scale_x_continuous(expand = c(0, 0),
                     limits = c(1, 106),
                     breaks = c(1, 14, 27, 40, 53, 66, 79, 92),
                     labels = c("Jan", "Apr", "Jul", "Oct", "Jan", "Apr", "Jul", "Oct")) +
  # add year x axis labels
  coord_cartesian(clip = "off") +
  annotate_pilot(label = "2020", x = 27, y = 0, 
                 alpha = 0.7, vjust = 3.8,size = 10) +
  annotate_pilot(label = "2021", x = 79, y = 0, 
                 alpha = 0.7, vjust = 3.8, size = 10) +
  labs(title = "Duck, geese & swan observations in Melbourne (2020-21 vs. 2017-19)") +
  theme_pilot(grid = "",
              axes = "bl") +
  theme(axis.title.x = element_text(size = 23, vjust = -1.3),
        axis.title.y = element_text(size = 23),
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20),
        axis.line = element_line(linewidth = 0.5),
        legend.text = element_text(size = 23),
        legend.title = element_text(size = 20),
        plot.caption = element_text(size = 18),
        text = element_text(family = "hind"),
        plot.title = element_markdown(family = "mont", size = 31),
        plot.subtitle = element_text(family = "mont", size = 28))

Now that we have our final plot, we can see a few interesting trends:

  1. Species observations were lower than the 2017-2019 average during the first lockdown (soon after COVID-19 arrived in Australia).
  2. Species observations were higher than the 2017-2019 average during the 2 longer lockdowns.
  3. Observations increased in the last half of all lockdowns except the first lockdown.

Are these trends that you expected to see?

It’s impossible to make any claims of why these trends emerged from our data visualisation alone, but we can speculate (for fun).

Were people making fewer observations in the first lockdown because they were preoccupied with other priorities as they settled into new working-from-home routines? Did people make more observations at the tail end of lockdowns because they were spending more time by natural ponds and lakes, seeking anxiety relief as they grew weary spending time at home?

Some evidence from the US found more people were using natural spaces during COVID-19, and time in these spaces lowered depression and anxiety. A New Zealand study found similar results.

Final thoughts

This post has detailed you can use ALA data to explore relationships between record counts and events. Although we can’t make any causal claims from what we see in our plot, it’s a nice way to do some exploratory analysis of a lot of data!

Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.2 (2022-10-31 ucrt)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2023-03-15
 pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 dplyr       * 1.1.0   2023-01-29 [1] CRAN (R 4.2.2)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.2)
 galah       * 1.5.1   2023-02-21 [1] Github (AtlasOfLivingAustralia/galah@bd43dd2)
 ggplot2     * 3.4.1   2023-02-10 [1] CRAN (R 4.2.2)
 ggtext      * 0.1.2   2022-09-16 [1] CRAN (R 4.2.2)
 htmltools   * 0.5.4   2022-12-07 [1] CRAN (R 4.2.2)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.2.2)
 pilot       * 4.0.0   2022-07-13 [1] Github (olihawkins/pilot@f08cc16)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.2.2)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.2)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 showtext    * 0.9-5   2022-02-09 [1] CRAN (R 4.2.1)
 showtextdb  * 3.0     2020-06-04 [1] CRAN (R 4.2.1)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.2)
 sysfonts    * 0.8.8   2022-03-13 [1] CRAN (R 4.2.1)
 tibble      * 3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.2)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.2)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.2.2/library

──────────────────────────────────────────────────────────────────────────────